
This paper describes an empirical research focused on the confidence of speedups (or slowdowns) results. This problem arises in every empirical research, and specially in compiler research this is a crucial matter, because it is usual to report smaller speedups than other areas. But, because compilers have to optimize code for various different kinds of applications, another major concern is the input set that should be used to test the improvements achieved for some transformation. Not only the size of the inputs employed, but mainly the type of input and the type of behavior the program will be expected to have. The main issue though is on the methodology commonly applied for empirical research on compiler systems, the single-run for training and testing the programs.

Research in compiler transformations often demonstrates heroic efforts in both the identification and abstract analysis of opportunities to improve program efficiency, and in the concrete implementation of these ideas.  However, standard practices at the evaluation stage of the scientific process are modest at best, perhaps because code transformations have a long history of providing significant benefits in practical, every-day situations.  In most cases, compilers are evaluated using a collection of programs, with each program evaluated using a timing run on a single evaluation input.

The deficiencies of this evaluation process are particularly prevalent, and especially disconcerting, when {\it feedback-directed optimization} (\FDO) is used to guide a transformation.  In this scenario, instrumentation is inserted into the program during an initial compilation in order to collect a profile of the run-time behavior of the program during one or more training runs.  The profile is used in a second compilation of the program to help the compiler assess the benefit of code transformation opportunities.  

The current standard practice for evaluating an \FDO\ compiler uses the profile of a single-training input to guide transformations, and evaluates the transformed program with a single evaluation input.  These standard practices set program inputs as controlled variables.  However, performance evaluation should be generalizable to real-world program workloads. Consequently, the program-input dimensions of a rigorous evaluation of compiler performance must be manipulated variables.

%===================== FDI

Previous work has not addressed the problem of representing and utilizing multi-run profiles.  An \FDO\ compiler should not simply add or average profiles from multiple runs, because such a profile does not provide any information about the variations in program behaviors observed between different inputs. Berube uses {\it Combined Profiling} (\CP) to merge the profiles from multiple runs into a distribution model that allows code transformations to consider cross-run behavior variations~\cite{BerubePhD}.  Experimental results demonstrate that meaningful behavior variation is present in the program workloads, and that this variation is successfully captured and represented by the \CP\ methodology.
%===================== single-run issue
%Although the usual way to do research in Feedback Directed Optimization is to perform a single-run input training and single data testing, 
Recently other approaches to this problem are being developed. 
The main goal of these new approaches is to perform multiple-runs under multiple data, because some questions concerning the single-run approach arose, such as, is this method accurate, or proper, or reliable?

Recent work \cite{Kalibera2013} states that execution time is a key measurement, for example $90$ out of $122$ papers presented in 2011 at PLDI, ASPLOS, and ISMM, or published in TOPLAS and TACO. As reported by Kalibera and Jones, the overwhelming majority of these papers has shown results either impossible to repeat, or didn't demonstrate their performance claims, there were no measure of variation for their results~\cite{Kalibera2013}. This research also focus on execution time, but showing the need of a methodology that allows the researcher to control the measurement errors, or at least to provide sufficient evidence of performance improvement.

%===================== Inlining and Case study

There have been some recent efforts trying to apply multiple profiles to \FDO\, and also to evaluate the performance of a program from multiple inputs. \CP\ methodology addresses this problem and can be applied to many different optimization techniques, such as inlining, loop unrolling, etc. In this research \CP\ is applied to inlining as a case study, because it allows many other optimization techniques to be performed afterwards.

This paper discusses these issues by constructing a ``false'' speedup from actual data, just ignoring our multiple runs strategy and literally picking parts of our collected data to show that many results are possible in a single-run scheme. We also point out that a ``false'' slowdown can also be picked from our data. This way the use of multiple-run methodologies is reinforced.

%===================== Questions

Several open questions about the use of profiles collected from multiple runs of a program were addressed and assessed in \cite{BerubeISPASS12}. Now there are still some questions, as multiple profiles are combined. What is the impact of \CP\ in a controlled case study? \FDO\ decisions can be more accurate using \CP\ instead of single-run evaluation?

This paper addresses these questions by employing a case study of the \CP\ process. As already mentioned the case proposed was for inlining, and we compared the \CP\ process with the single-run process. The application of \CP\ to other situations with multiple profiling instances, such as profiling program phases individually, is not within the scope of this paper.

%===================== Contributions

The main contribution of this paper are:
\begin{itemize}
\item {\it Methodological considerations} The behavior of single-runs and \CP-runs are compared and analyzed. We show that single-run methodologies are error-prone.

\item {\it Case studies} The cases studied illustrate that the single-run methodology can induce the researcher to serious errors, and that a methodology like \CP\ is better suited to evaluate performance.

\end{itemize}

%===================== Structure

This paper has eight sections, the introduction, where the research problem is posed and the main ideas are shown. The inlining transformation is described in the next section, and then the next section describes the problem and the whole setting of this research. Following starts the section where the ``speedup'' is presented and also has a notice on a ``slowdown'' for the same problem. After this section, the environment is analyzed and provides sufficient statistical information to explain what happened in the previous section, and also what may happen in experiments using the same methodology. Following the data analysis employed in the latter section, the next section shows how this problem can be avoided by means of the \CP\ methodology. This paper ends with a discussion on related work, and the conclusion.
